Biomimetic Multi-Resolution Analysis for Robust Speaker Recognition
نویسندگان
چکیده
Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing. Introduction In addition to the intended message, human voice carries the unique imprint of a speaker. Just like fingerprints and faces, voice prints are biometric markers with tremendous potential for forensic, military, and commercial applications [1]. However, despite enormous advances in computing technology over the last few decades, automatic speaker verification (ASV) systems still rely heavily on training data collected in controlled environments, and most systems face a rapid degradation in performance when operating under previously unseen conditions (e.g. channel mismatch, environmental noise, or reverberation). In contrast, human perception of speech and ability to identify sound sources (including voices) is quite remarkable even at relatively high distortion levels [2]. Consequently, the pursuit of human-like recognition capabilities has spurred great interest in understanding how humans perceive and process speech signals. One of the intriguing processes taking place in the central auditory system involves ensembles of neurons with variable tuning to spectral profiles of acoustic signals. In *Correspondence: [email protected] 1Department of Electrical and Computer Engineering, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA Full list of author information is available at the end of the article addition to the frequency (tonotopic) organization emerging as early as the cochlea, neurons in the central auditory system (specifically in the midbrain and more prominently in the auditory cortex) exhibit tuning to a variety of filter bandwidths and shapes [3]. This elegant neural architecture provides a detailed multi-resolution analysis of the spectral sound profile, which is presumably relevant to speech and speaker recognition. Only few studies so far have attempted to use this cortical representation in speech processing, yielding some improvements for automatic speech recognition at the expense of substantial computational complexity [4,5]. To the best of our knowledge, no similar work was done in ASV. In the present report, we explore the use of a multiresolution analysis for robust speaker verification. Our representation is simple, effective, and computationallyefficient. The proposed scheme is carefully optimized to be particularly sensitive to the information-rich spectrotemporal attributes of the signal whilemaintaining robustness to unseen noise distortions. The choice of model parameters builds on our current knowledge of psychophysical principles of speech perception in noise [6,7] complemented with a statistical analysis of the dependencies between spectral details of the message and speaker information.We evaluate the proposed features in an ASV system and compare it against one of the best performing © 2012 Nemala et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Nemala et al. EURASIP Journal on Audio, Speech, andMusic Processing 2012, 2012:22 Page 2 of 10 http://asmp.eurasipjournals.com/content/2012/1/22 systems in NIST 2010 SRE evaluation [8] under detrimental conditions such as white noise, non-stationary additive noise, and reverberation. The following section describes details of the proposed multi-resolution spectro-temporal model. It is followed by an analysis that motivates the choice of model parameters to maximize speaker information retention. Next, we describe the experimental setup and results. We finish with a discussion of these results and comment on potential extensions towards achieving further noise robustness. The biomimetic multi-resolution analysis An overview of the processing chain described in this section is presented in Figure 1. Peripheral analysis The speech signal is processed through a pre-emphasis stage (implemented as a first-order high pass filter with pre-emphasis coefficient 0.97), and a time-frequency auditory spectrogram is generated using a biomimetic sound processing model described in details in [9] and briefly summarized here (Equation 1). First, the signal s(t) undergoes a cochlear frequency analysis modeled by a bank of 128 constant-Q (Q = 4) highly asymmetric bandpass filters h(t; f ) equally spaced over the span of 51/3 octaves on a logarithmic frequency axis. The filterbank output is a spatiotemporal pattern of cochlea basilarmembrane displacements ycoch(t, f ) over 128 channels. Next, a lateral inhibitory network detects discontinuities in the responses across the tonotopic (frequency) axis, resulting in further filterbank frequency selectivity enhancement. This step is modeled as a first-order differentiation operation across the channel array followed by a half-wave rectifier and a short-term integrator. The temporal integration window is given by μ(t; τ) = e−t/τu(t) with time constant τ = 10ms mimicking the further loss of phaselocking observed in the midbrain. This time constant controls the frame rate of the spectral vectors. Finally, a nonlinear cubic root compression of the spectrum is performed, resulting in an auditory spectrogram y(t, f ): ycoch(t, f ) = s(t)⊗t h(t; f ), ylin(t, f ) = max(∂tycoch(t, f ), 0), y(t, f ) = [ylin(t, f )⊗t μ(t; τ)]1/3 , (1) where ⊗t represents convolution with respect to time. The choice of the auditory spectrogram is motivated by its neurophysiological foundation as well as its proven selfnormalization and robustness properties (see [10] for full details). Spectral cortical analysis The auditory spectrogram is processed further in order to capture the spectral details present in each spectral slice. The processing is based on neurophysiological findings that neurons in the central auditory pathway are tuned not only to frequencies but also to spectral shapes, in particular to peaks of various widths on the log-frequency axis [3,11,12]. The spectral width is characterized by a parameter called scale and is measured in cycles per octave, or CPO. Physiological data indicates that auditory cortex neurons are highly scale-selective, thus expanding the cochlear one-dimensional tonotopic axis onto a twodimensional sheet that explicitly encodes tonotopy as well as spectral shape details (see Figures 1 and 2). The cortical analysis is implemented using a bank of modulation filters operating in the Fourier domain. The algorithm processes each data frame individually. The Fourier transform of each spectral slice y(t0, f ) is multiplied by a modulation filter HS( ; c) that is tuned to spectral features of scale c. The filtering operates on the magnitude of the signal. After filtering, the inverse Fourier transform is performed and the real part is taken as the new filtered slice. This process is then repeated with a Figure 1 An outline of the cortical features extraction algorithm. A schematic diagram of the algorithm that transforms a speech waveform into a sequence of cortical feature vectors. Nemala et al. EURASIP Journal on Audio, Speech, andMusic Processing 2012, 2012:22 Page 3 of 10 http://asmp.eurasipjournals.com/content/2012/1/22 Figure 2 Details of the speech spectral analysis. (a) The speech spectrogram is analyzed separately at each time instant. Each spectrogram slice is filtered through a bandpass filter HS( ; c) parameterized by c . The ∗ operator signifies the filtering operation. Four such filtering operations yield four views of the same spectral slice; each view highlights different details about the spectrum, notably formant peaks and harmonic structure. (b) Cortical features for clean and noisy versions of one phoneme \ow\. The plots show magnitude as a function of frequency and scale. For visualization, the discrete image points have been interpolated in MATLAB using a bicubic interpolation routine. Notice the consistency of formant peaks around 1 and 4 KHz and of harmonic energies at 2 CPO and 4 CPO despite the additive noise distortion. (c) Cortical features for different types of additive noise. Note that the patterns exhibited are quite different. Subtle peaks due to harmonicity and formant structure of human speech can be seen in the left panel (babble noise). number of different c, yielding a number of filtered spectrograms y(t, f ; c), each with features of scale c emphasized (see Figure 1). This set of spectrograms constitutes the spectral cortical representation of the sound. The filter HS( ; c) is defined as HS( ; c) = ( / c)2e[1−( / c)], 0 ≤ ≤ max, (2) where max is the highest spectral modulation frequency (set at 12 CPO given our spectrogram resolution of 24 channels per octave). Choice of spectral parameters The set of scales c is chosen by dividing the spectral modulation axis into equal energy regions using a training corpus (TIMIT database [13]) as described below. Define the average spectral modulation profile Y ( ) = 〈〈|Y ( ; t0)|〉T 〉 as the ensemble mean of the magnitude Fourier transform of the spectral slice y(t0, f ) averaged over all times T and over entire speech corpus . The resulting ensemble profile (shown in Figure 3a) is then divided intoM equal energy regions k :
منابع مشابه
Recognizing the message and the messenger: biomimetic spectral analysis for robust speech and speaker recognition
Humans are quite adept at communicating in presence of noise. However most speech processing systems, like automatic speech and speaker recognition systems, suffer from a significant drop in performance when speech signals are corrupted with unseen background distortions. The proposed work explores the use of a biologically-motivated multi-resolution spectral analysis for speech representation....
متن کاملRobust speaker verification using short-time frequency with long-time window and fusion of multi-resolutions
This study presents a novel approach of feature analysis to speaker verification. There are two main contributions in this paper. First, the feature analysis of short-time frequency with long-time window (SFLW) is a compact feature for the efficiency of speaker verification. The purpose of SFLW is to take account of short-time frequency characteristics and longtime resolution at the same time. ...
متن کاملStudy on Text-Dependent Speaker Recognition Based on Biomimetic Pattern Recognition
We studied the application of Biomimetic Pattern Recognition to speaker recognition. A speaker recognition neural network using network matching degree as criterion is proposed. It has been used in the system of textdependent speaker recognition. Experimental results show that good effect could be obtained even with lesser samples. Furthermore, the misrecognition caused by untrained speakers oc...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملBiomimetic Pattern Recognition
In speaker-independent speech recognition, the disadvantage of the most diffused technology ( Hidden Markov Models) is not only the need of many more training samples, but also long train time requirement. This paper describes the use of Biomimetic Pattern Recognition (BPR) in recognizing some Mandarin Speech in a speaker-independent manner. The vocabulary of the system consists of 15 Chinese d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- EURASIP J. Audio, Speech and Music Processing
دوره 2012 شماره
صفحات -
تاریخ انتشار 2012